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AMENDMENTS TO THE CLAIMS 

The Assignee submits below a complete listing of the current claims, including marked-up 
claims with insertions indicated by underlining and deletions indicated by strikeouts and/or double 
bracketing. This listing of claims replaces all prior versions and listings of claims in the application: 

1 . (Currently amended) [[A]] An article of manufacture comprising a program storage device 
readable by a machine, tangibly embodying a program of instructions executable by the machine to 
perform a method for speech synthesis that allows user specified pronunciations, the method 
comprising: 

providing a user interface that allows a user to identify a text string for synthesis and to 
speak a pronunciation of the text string; 

recording the user's spoken pronunciation of the text string as an audio signal; 

extracting prosodic parameter values from [[an]] the audio signal corresponding to [[a]] the 
user's pronunciation of [[a]] the text strings, by a us e r; wherein extracting the prosodic parameter 
values comprises extracting duration parameter values from the audio signal by aligning the audio 
signal with the text string; 

adopting as synthesis parameter values the prosodic parameter values and duration 
parameter values extracted from the audio signal; and 

generating a synthetic speech waveform representing the text string using the synthesis 
parameter values. 

2-3. (Canceled) 

4. (Currently amended) The program storage devic e article of manufacture of claim 1, wherein 
the instructions for aligning comprise instructions for the extracting duration parameter values by 
aligning comprises segmenting the audio signal into time-segmented regions, wherein each time- 
segmented region is mapped to a corresponding phoneme. 
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5. (Currently amended) The program storage device article of manufacture of claim 1 , wherein 
the alignment is performed extracting duration parameter values by aligning comprises using a 
Viterbi alignment process. 

6. (Canceled) 

7. (Currently amended) The program storag e device article of manufacture of claim 1 , wherein 
the instructions for generating the synthetic speech waveform comprise instructions for adopting 
comprises directly specifying at least one portion of the synthesis prosodic parameter values as 
attribute values for mark-up elements. 

8. (Currently amended) The program storage device article of manufacture of claim 1, wherein 
the instructions for generating the synthetic spe e ch waveform comprise instructions for adopting 
comprises assigning abstract labels to at least one portion of the synthesis prosodic parameter values 
to generate a high-level markup of the text string. 

9. (Currently amended) The program storage device article of manufacture of claim I, wherein 
the generating adopting comprises generating a markup of the text string using SSML (speech 
synthesis markup language). 

10. (Currently amended) The program storage device article of manufacture of claim 1, further 
comprising instructions for processing phonetic content of the audio signal to generate the synthetic 
speech waveform having a desired pronunciation. 

1 1 . (Currently amended) A method for speech synthesis that allows user specified 
pronunciations, the method comprising: 

providing a user interface that allows a user to identify a text string for synthesis and to 
speak a pronunciation of the text string; 

recording the user's spoken pronunciation of the text string as an audio signal; 
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extracting prosodic parameter values from [[an]] the audio signal corresponding to [[a]] the 



values comprises extracting duration parameter values from the audio signal by aligning the audio 
signal with the text string; 

adopting as synthesis parameter values the prosodic parameter values and duration 



generating a synthetic speech waveform representing the text string using the synthesis 
parameter values. 

12-13. (Canceled) 

14. (Previously Presented) The method of claim 11, wherein aligning comprises extracting 
acoustic feature data from the audio signal and time-aligning the audio signal to the text string 
using the acoustic feature data. 

1 5 . (Previously Presented) The method of claim 1 1 , wherein aligning is performed using a 
Viterbi alignment process. 

16. (Canceled) 

17. (Currently amended) The method of claim 1 1 , wherein generating the synthetic speech 
waveform the adopting comprises directly specifying at least one portion of the synthesis prosodic 
parameter values as attribute values for mark-up elements. 

18. (Currently amended) The method of claim 1 1 , wherein generating the s ynthetic speech 
waveform the adopting comprises assigning abstract labels to at least one portion of the synth e sis 
prosodic parameter values to generate a high-level markup of the text string. 



user's pronunciation of [[a]] the text strings by- 



5 wherein extracting the prosodic parameter 




extracted from the audio signal; and 



2486982.1 



Application No. 10/672,374 

Reply to Office Action of August 30, 201 1 



5 



Docket No.: N0484.70760USOO 



19. (Currently amended) The method of claim 1 1 , wherein the generating adopting comprises 
generating a markup of the text string using SSML (speech synthesis markup language). 

20. (Previously Presented) The method of claim 1 1 , further comprising processing phonetic 
content of the audio signal to generate the synthetic speech waveform having a desired 
pronunciation. 

21 . (Currently amended) A text-to-speech (TTS) system that allows user specified 
pronunciations, the system comprising: 

a prosody analyzer for determining synthesi s paramete r values from an audio signal 
corresponding to a pronunciation by a user of an input text string, wherein the prosody analyzer 
compris e s: 

a prosodic parameter extraction module for extracting prosodic parameter 
valu e s from the audio signal, 

an alignm e nt module for extracting duration parameter values from th e audio 
signal by aligning the input text string with the audio s ignal, and 

a conversion module for adopting as synthesis parameter values the prosodic 
parameter values and duration param e ter valu e s e xtract e d from th e audio signal; and 
a TTS engine for generating a synthetic speech waveform using the synthesis parameter 

at least one processor: and 

at least one storage device storing processor-executable instructions that, when executed by 
the at least one processor, perform a method comprising: 

providing a user interface that allows a user to identify a text string for synthesis and 
to speak a pronunciation of the text string; 

recording the user's spoken pronunciation of the text string as an audio signal; 

extracting prosodic parameter values from the audio signal corresponding to the 
user's pronunciation of the text string, wherein extracting the prosodic parameter values 
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comprises extractinfi duration parameter values from the audio signal by aligning the audio 
signal with the text string; 

adopting as synthesis parameter values the prosodic parameter values extracted from 
the audio signal; and 

generating a synthetic speech waveform representing the text string using the 
synthesis parameter values. 

22-28. (Canceled) 

29. (Currently amended) The program storage device article of manufacture of claim 33, 
wherein extracting acoustic feature data from the audio signal comprises digitizing the audio signal 
into a set of frames and transforming the digitized audio signal into a set of feature vectors on a 
frame-by-frame basis. 

30. (Currently amended) The program storage devic e article of manufacture of claim 29, 
wherein transforming the digitized audio signal comprises producing a 24-dimensional cepstra 
feature vector for every 10ms of the audio signal, concatenating frames to the left and to the right of 
a current frame to augment a current cepstral vector, and reducing each augmented cepstral vector to 
a 60-dimensional feature vector using linear discriminant analysis. 

3 1 . (Previously Presented) The method of claim 34. wherein extracting acoustic feature data 
from the audio signal comprises digitizing the audio signal into a set of frames and transforming the 
digitized audio signal into a set of feature vectors on a frame-by-frame basis. 

32. (Previously Presented) The method of claim 3 1 , wherein transforming the digitized audio 
signal comprises producing a 24-dimensional cepstra feature vector for every 10ms of the audio 
signal, concatenating frames to the left and to the right of a current frame to augment a current 
cepstral vector, and reducing each augmented cepstral vector to a 60-dimensional feature vector 
using linear discriminant analysis. 
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33. (Currently amended) The program storage device article of manufacture of claim 1 4 wherein 
the method further comprises extracting acoustic feature data from the audio signal and wherein the 
aligning further comprises outputting a-sefc-ef one or more duration contours. 

34. (Currently amended) The method of claim 1 1 2 furth e r comprising e xtracting acoustic f e atur e 
data from the audio signal and wherein the aligning further comprises outputting a-set-ef oneor 
more duration contours. 

3 5 . (Currently amended) The system of claim 2 1 x wherein the prosody analyzer method further 
comprises an acoustic feature extraction module that extracts extracting acoustic feature data from 
the audio signal and wherein th e alignm e nt modul e uses said acoustic featur e data to p e rform the 
aligning comprises outputting one or more duration contours . 

36. (New) The article of manufacture of claim 1, wherein the adopting comprises directly 
specifying at least one portion of the extracted prosodic parameter values as prosodic parameter 
values for synthesis of the synthetic speech waveform representing the text string. 

37. (New) The method of claim 11, wherein the adopting comprises directly specifying at least 
one portion of the extracted prosodic parameter values as prosodic parameter values for synthesis of 
the synthetic speech waveform representing the text string. 
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